Data Overview
Table of Contents
Books
- Database Management Systems, 3rd Edition. Raghu Ramakrishnan and Johannes Gehrke.
topic
Research and Industry
- Research 1: Distributed and Parallel Databases
- Research 2: Indexing and Physical Database Design I 10:00:58 PM
- Research 3: Data Cleaning and Integration 10:01:04 PM
- Research 4: Query Processing and Optimization 10:01:12 PM
- Research 5: Social Networks and Graph Databases I 10:01:19 PM
- Research 6: Data Visualization, Error Reporting 10:01:25 PM
- Research 7: Storage Systems, Query Processing and Optimization
- Research 8: Data Streams and Sensor Networks
- Research 9: Mobile Databases
- Research 10: Data Analytics 10:01:55 PM
- Research 11: Crowdsourcing, Uncertainty in Databases
- Research 12: Top-k Query Processing and Optimization
- Research 13: Temporal and Graph Databases 10:02:19 PM
- Research 14: Information Retrieval and Text Mining
- Industry 1: Databases in the Cloud 10:03:26 PM
- Industry 2: Social Media and Crowd sourcing
- Industry 4: Big Data
Business Intelligence field
http://www.douban.com/note/224220973/
从bi全景来说,大致会有两种切法,一种切法从分析的视角来切,大致可以有数据分析、市场研究、行业分析等等等等,这里更多是从企业需求角度切入,符合 bi长期发展的趋势,另外一种从bi体系架构来分,从数据的产生获取,企业信息系统关系,信息咨询,分析应用这个角度来看,这种是过去几年国内主要的bi切分方式;
虽然我认同于第一种切分方式是未来的发展方向,是bi分析价值最大的必然趋势,但是从目前而言,绝大多数公司对于人员、组织需求的角度,第一种切分方法都更普遍一些;
从第二种切分方法来说,你需要知道哪些呢?
- 企业信息系统框架;这个概念不新,很多公司反复提自己的框架,不停得变,由于组织结构,管理方式,包括由于技术革新带来的新的管理可能性,从ibm,到teradata,到oracle都在提,这个模块里面,行业特点非常强,实施的个性化也非常突出;但是大的解释,可以说,企业信息应用是由众多系统组合而成,keywords:saas、企业信息总线、Data Architecture等
- 数据库技术;代表厂商也就是那几大家;不过近年由于一系列新的技术的产生,知识技能更新也很快;keywords:hadoop,mapreduce,big data, teradata,db2,oracle、exadata等;
- 服务器;这块前几年非常热火,基本上搞过sun,hp,ibm几家机器的,在市场上非常热火,阿里巴巴早几年很多技术人员在外面风生水起也是因为阿里巴巴的服务器很杂,哪块你都得搞;
- 数据处理;这块大,包括Data Governance & Steward、Data Quality、Data Standard、Master Data、Meta Data、Data Security;teradata的强项,国内无处其右;关键词:etl、teradata etl automation、informatica powercenter、ibm inforsadage
- 数据模型层;这块是看个人数据仓库功底的核心,teradata一套模型可以卖到上百万,但是新的big data理论上是革这块的命,这块有两个教父:bill inmon、kimball,两个人私交是好朋友,为了两套理论已经争了30年了,估计看谁赢,就看哪个先挂掉了,我个人比较认同kimball;keyword: teradata ldm、data model
- 数据分析应用(回头另起一篇专门介绍);不同公司对于这块的应用深度差别很大,从监控到分析,内涵极大,从基本的统计报表,到多维分析、趋势分析、ad hoc、专题等等等等等等,比较突出的工具有cognos、润前等等,当然最好的是excel跟ppt;
- 数据挖掘应用;这块不需要多说了,分行业差别比较大;讨论得也比较多,厂商有sas,spss,还有特别推荐的R;从早期应用比较多的预测、聚类等等,现在新兴的文本挖掘、个性化推荐、用户研究等等;
- 各色营销平台;基于数据的应用平台,主要在于基于数据的客户分析、渠道管理,内容管理、营销管理;keyword:aprimo、unica等
General Data Science
- Introduction to Data Science: Coursera course starting Jan 2014
- CS109 Data Science: The course is built around three modules: prediction and elections, recommendation and business analytics, and sampling and social network analysis
- Learn Data Science: Open content for self-directed learning in data science
- The Open-Source Data Science Masters - Curriculum: The Curriculum for learning Data Science, Open Source and at your fingertips.
- Introduction to Data Science: This book was developed for the Certificate of Data Science program at Syracuse University’s School of Information Studies.
- How to Prepare Data For Machine Learning: This blog post is a good primer on data preparation for analysis
- The Analytics Edge: edX MITx course related to Analytics. With real-life examples and exercises using R.
Data Sources
- Wolfram Alpha
- Jake Hofman Data Links
- Peter Skomoroch (Linkedin) Data Links
- Hilary Mason (bitly) Data Links
- Wikipedia Database
- IMDB Data
- Last.fm Database
- Quandl
- Datamob
- Factual
- Metro Boston Data Common
- Census.gov
- Data.gov
- Dataverse Network
- Infochimps
- Linked Data
- Guardian DataBlog
- Data Market
- Reddit Open Data
- Climate Data Sources
- Climate Station Records
- CDC Data
- World Bank Catalog
- Free SVG Maps
- Office for National Statistics
- StateMaster
Web Sites & Blogs
- Kaggle:https://www.kaggle.com/ Kaggle is a platform for predictive modelling and analytics competitions on which companies and researchers post their data and statisticians and data miners from all over the world compete to produce the best models.
- Gapminder
- Flowing Data
- Information Aesthetics
- Many Eyes
- Visual Complexity
- TF Johnny’s Site (Sample Code)
- WTF Visualizations
- H2O The open source prediction engine for big data science and github.